neural text-to-speech
Large-Scale Automatic Audiobook Creation
Walsh, Brendan, Hamilton, Mark, Newby, Greg, Wang, Xi, Ruan, Serena, Zhao, Sheng, He, Lei, Zhang, Shaofei, Dettinger, Eric, Freeman, William T., Weimer, Markus
An audiobook can dramatically improve a work of literature's accessibility and improve reader engagement. However, audiobooks can take hundreds of hours of human effort to create, edit, and publish. In this work, we present a system that can automatically generate high-quality audiobooks from online e-books. In particular, we leverage recent advances in neural text-to-speech to create and release thousands of human-quality, open-license audiobooks from the Project Gutenberg e-book collection. Our method can identify the proper subset of e-book content to read for a wide collection of diversely structured books and can operate on hundreds of books in parallel. Our system allows users to customize an audiobook's speaking speed and style, emotional intonation, and can even match a desired voice using a small amount of sample audio. This work contributed over five thousand open-license audiobooks and an interactive demo that allows users to quickly create their own customized audiobooks. To listen to the audiobook collection visit \url{https://aka.ms/audiobook}.
GitHub - jaketae/storyteller: Multimodal AI Story Teller, built with Stable Diffusion, GPT, and neural text-to-speech
A multimodal AI story teller, built with Stable Diffusion, GPT, and neural text-to-speech (TTS). Given a prompt as an opening line of a story, GPT writes the rest of the plot; Stable Diffusion draws an image for each sentence; a TTS model narrates each line, resulting in a fully animated video of a short story, replete with audio and visuals.
Enhancing audio quality for expressive Neural Text-to-Speech
Ezzerg, Abdelhamid, Gabrys, Adam, Putrycz, Bartosz, Korzekwa, Daniel, Saez-Trigueros, Daniel, McHardy, David, Pokora, Kamil, Lachowicz, Jakub, Lorenzo-Trueba, Jaime, Klimkov, Viacheslav
Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modeling; and Figure 1: Overview of model architecture. The system can be the use of Variational Auto-Encoders in both the acoustic model broken into two parts: an acoustic model and a neural vocoder and the neural vocoder. We show that, when combined, these that produces waveform. Orange blocks highlight the building techniques greatly closed the gap in perceived naturalness between neural network blocks for the acoustic model while the neural the baseline system and recordings by 39% in terms of vocoder is represented by a blue box.
AWS Polly gains neural voices in U.S. Spanish and Brazilian Portuguese
Months after Amazon launched in general availability Neural Text-To-Speech (NTTS) and newscaster style in Amazon Polly, a cloud service that converts text into speech, the Seattle company today debuted two new NTTS voices in U.S. Spanish and Brazilian Portuguese: "Lupe" and "Camila." Like the U.S. English NTTS voice before them, they mimic things like stress and intonation in speech courtesy by identifying tonal patterns. Neural versions of Camila and Lupe are available in Amazon Web Services' (AWS) U.S. East (N. Standard variants are also available across 18 AWS regions, bringing Polly's total number of voices to 61 across 29 languages and the total number of voices available in both standard and neural versions to 13 across four languages. According to Amazon text-to-speech program manager Marta Smolarek, the new U.S. Spanish voice -- Lupe, which is the third U.S. text-to-speech voice in Polly -- not only speaks Spanish but also handles English and provides a fully bilingual Spanish-English experience.
Alexa is now programmed to sound like a real-life news anchor
Amazon Alexa has been programmed to read the news headlines in the style of a newsreader. The popular voice assistant will now emphasise words, and mimic the intonation and pace of a TV anchor to present the news in a more natural way. Newsreader Alexa has been trained to read the daily bulletins when the user says'Alexa, what's the latest?' Amazon Alexa has been programmed to read the news headlines in the style of a newsreader. The virtual assistant already was able to read out the headlines but using the traditional robotic voice. Amazon conducted tests and found that people preferred hearing the news in this more realistic and listener friendly manner, compared to the robotic tone.